Thomas (Tom) E. White
- Email: thomas.white@sydney.edu.au
- Office: A08 337
- Lab: The Sensory and Evolutionary Ecology (SEE) Lab
The Sensory and Evolutionary Ecology (SEE) Lab
The evo-ecology of information
Behaviour
- Communication
- Perception
- Decision-making
Evo-Ecology
- Sexual selection
- Insect <-> plant
- Predator <-> prey
Â
Meta-science
- Tools and methods
- Meta-analysis
- Evidence synthesis
codeRs?
- Nice to keep in touch, hear what’s happening
- Fun to learn stuff from and with others
- Have a spot (physical & online) to ask for ideas, help, direction
- Food
- Ask each
- Total work-in-progress, will evolve continuously, so ideas welcome!
- solescodeRs.github.io
- Slack
Tidying up: Outcomes
- Understand the principles and importance of reproducibility in science
- Learn the key steps in producing reproducible research
- Create and detail the structure and value of ‘tidy’ projects, data, and code
Science is open & robust
Â
Features of robust science
  Â
Reproducible:
  Â
Replicable:
Features of robust science
  Â
Reproducible: The same result can be independently reached given the same data & analysis pipeline.
  Â
Replicable: The same result can be independently reached given independent data & analysis pipeline.
Why conduct reproducible science?
Huge practical benefits!
- Easier to share and reuse, as projects are richly documented and detailed, which is the point of science!
- Newly-collected data can be integrated easily into existing projects and analyses
- Mistakes are easier to detect and remedy
- Documents (e.g. manuscripts) are easier to revise & update as data and/or analyses change
- The exact steps you took to produce an end-result (e.g. a manuscript) will be richly documented and self-contained forever & ever
- You can more easily steal from yourself & minimise the duplication of effort into the future
- Easier to collaborate (once everyone’s on-board)
Practical principles for reproducible research
Principle 1: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why. (N.B ‘Someone’ includes future-you).
Principle 2: Everything you do, you will probably have to do over again.
Practical steps to reproducible research
Practical steps to reproducible research
Practical steps to reproducible research
Tidy projects
Â
Tidy data
Â
Tidy code
Â
Tidy projects
What is a tidy project?
Source: phil_g
What is a tidy project?
- Hard to find anything
- What’s canonical?
- What’s important?
- Errors guaranteed & tough to trace
- Unshareable
- It’s just awful
What is a tidy project?
Three tips for tidy projects
- Make it self-contained
- Create a consistent, sensibly-named directory structure
- Include a readme, documenting the layout & contents
For example
- informative_project_name
- README.txt (a readme text file at the top-level of the directory which outlines the broad structure/details of the project)
- /data (raw data, such as images or videos, as well as the processed products for analysis)
- /doc (all notes and the draft manuscript associated with the project)
- /figs (figures to be included in the manuscript, typically generated via code)
- /output (programmatically-generated output from data handling and analysis such as tables of statistical results, which can be re-generated at any time)
- /R (code for processing and analysing data)
Â
Tidy data
What is tidy data?
Eight golden rules for data organisation
- Each variable forms a column, each observation a row
- Use plain text
- Choose good names
- No empty cells
- Use metadata
- Treat raw data as read-only
- Be consistent
- Dates are awful
1. Each variable a column, observation a row
1. Each variable a column, observation a row
Messy
Source: Wickham (2009) & Pew Research Center
1. Each variable a column, observation a row
Tidy
Source: Wickham (2009) & Pew Research Center
1. Each variable a column, observation a row
Messy
Source: Wickham (2009)
1. Each variable a column, observation a row
Tidy
Source: Wickham (2009)
2. Use plain text
Microsoft excel through the ages
.xls.xlt.xlm.xlam.xltm.xlsx.xltx...
Text through the ages
.txt
2. Use plain text
Types of text file
.csv: comma-separated values. Great all-purpose format..txtor.tsv: plain-text/tab-delimited.- Future-proof
- Can be opened with anything/anywhere
3. Choose good names
Untidy
- myabstract.docx
- Tom’s best ideas.docx
- figure 1.png
- newNEWv2_dontdelete_forREAL_dont_FINALfinal_v2.xlsx
Tidy
- 2020_abstract_for_hons_conf.docx
- toms_ideas.docx
- fig_01_scatterplot_length_width.png
- 2019-08-07_raw_data_LIFE4000.xlsx
3. Choose good names
Good names are
- Machine readable
- Human readable
- Nicely ordered
3. Choose good names
Good names are
- Machine readable
- No special characters or formatting
Â
! @ # $ % ^ & * ( ) ~ + =
3. Choose good names
Good names are
- Machine readable
- No special characters or formatting
Â
_ -
Â
separating_metadata and splitting-up-words
3. Choose good names
Good names are
- Machine readable
- Human readable
- Names contain information on content
Â
data 1.csv
3. Choose good names
Good names are
- Machine readable
- Human readable
- Names contain information on content
Â
2020-08-09_field-data_heights-weights.csv
3. Choose good names
Good names are
- Machine readable
- Human readable
- Nicely ordered
- Think about sorting
Chronological
2020-08-09_field-data_heights-weights.csv
2020-08-12_field-data_heights-weights.csv
2020-08-18_field-data_heights-weights.csv
3. Choose good names
Good names are
- Machine readable
- Human readable
- Nicely ordered
- Think about sorting
Logical
01_load_functions.R
02_clean_data.R
03_analysis.R
4. No empty cells
Or special characters
## cow_ID milk_volume weight ## 1 moo 12 1100 ## 2 bumbo 2 1201 ## 3 spot ? 1084 ## 4 jeffrey 1044 ## 5 holy 16 1244 ## 6 daisy - 1093
4. No empty cells
Use NA if NA, or 0 if 0
## cow_ID milk_volume weight ## 1 moo 12 1100 ## 2 bumbo 2 1201 ## 3 spot NA 1084 ## 4 jeffrey 0 1044 ## 5 holy 16 1244 ## 6 daisy 0 1093
5. Use metadata
or a ‘data dictionary’
5. Use metadata
or a ‘data dictionary’
- Metadata = data about data
- A file describing the contents & structure of a separate file
- The richer & more detailed the better
- Essential to reproducibility (not least for yourself)
6. Treat raw data as read-only
Hands off!
6. Treat raw data as read-only
Modify by hand (only when unavoidable)
- Create a work on a copy
- Document every change you make in a separate file
Modify via code (whenever possible)
7. Be consistent
e.g. Naming conventions
- snake_case
- camelCase
- SCREAMING_SNAKE_CASE
- kebab-case
- Train-Case
8. Dates are awful
MM/DD/YYDD/MM/YYYY/MM/DDDD-MM-YYYYMM-YY- Not to mention excel’s handling of them
Instead, split up the variables:
Or if you must, use the ISO standard: YYYY-MM-DD
R tools to help along the way
library(janitor)clean_names(): creates consistent, tidy-rule-following variable namesremove_empty(): remove rows/columns/both containing missing or empty dataconvert_to_date():take the fight to Excel’s concept of a date
library(tidyverse)- Set of ~25 packages for cleaning/wrangling/visualising…
tidyr: reshaping datadplyr: manipulating data
Â
Tidy code
Four steps to code cleanliness
- Choose good names & be consistent
- Write human-readable code
- Keep it self-contained
- Keep it well-styled (and use help)
1. Choose good names & be consistent
Good:
dat_heights_2020 <- read.csv('2020_field_data_heights.csv')
Less good (maybe)
dat_field <- read.csv('2020_field_data_heights.csv')
Bad
dat <- read.csv('2020_field_data_heights.csv')
2. Write human-readable code
## ----------------- Load data ----------------- ##
`dat_heights_2020 <- read.csv('2020_field_data_heights.csv')` # Summer 2020
`dat_heights_2021 <- read.csv('2021_field_data_heights.csv')` # Winter 2021
## ----------------- Summarise data ----------------- ##
# Calculate mean +- SD heights
dat_heights_summary %>%
summarise(mean = mean(),
sd = sd(),
n = n())
2. Write human-readable code
Good
height <- cm * 6 + mm mean(x, na.rm = TRUE)
Bad
height<-cm*6+mm mean(x,na.rm=TRUE)
2. Write human-readable code
Good
do_something_very_complicated( something = "that", requires = many, arguments = "which may be long" )
Bad
do_something_very_complicated("that", requires, many, arguments, "which may be long")
3. Keep it self-contained
Â
- Forget
setwd()exists - Assume a script is being run from the ‘root’ of the project
- Use paths relative to that
3. Keep it self-contained
Â
Bad
`data <- read.csv('C:/tomscomputer/projects/feeding_experiment/data/feeding_data.csv')`
- Won’t run on any other computer
- Won’t run on my computer, if I ever move it or modify my filesystem
3. Keep it self-contained
Â
Good
`data <- read.csv('data/feeding_data.csv')`
Also see here::here()
4. Keep it well-styled (and use help)
styler::style_file()
Before
height<-cm*6+mm+2; mean(x,na.rm=TRUE)
After
height <- cm * 6 + mm + 2 mean(x, na.rm = TRUE)
Outcomes
- Understand the principles and importance of reproducibility in science
- Learn the key steps in producing reproducible research
- Create and detail the structure and value of ‘tidy’ projects, data, and code
Â
Thanks!